Regression_Step 05 Modeling

Introduction

Going back to step 01, let's review our goals.

Objectives This project focuses on quantitative data science and mainly aims to perform two goals. One is to predict housing price based on multiple variables such as room size or building types etc. The other is to forecast sales of a real estate company.

These goals is the time series forecast part and is comlpeted in the differet file. Here, we focus on teh former goal, which is to make a financial model for a housnig price.

Using this model, we analyze some housing price transaction in 2018. Again, we are assuming we were in Jan, 2018.

Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

from library.sb_utils import save_file

# For normalization
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# For metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score

%matplotlib inline

%pdb 0
Automatic pdb calling has been turned OFF

Set parameteres

In [3]:
# Set the number of display
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', -1)  # or 199

pd.options.display.float_format = '{:.4f}'.format
In [4]:
def load_CVmodel(model_path,expected_model_version = '1.0'):
    import pickle
    from sklearn import __version__ as sklearn_version
    
    if os.path.exists(model_path):
        with open(model_path, 'rb') as f:
            model = pickle.load(f)
            print("Expected model version has been loaded")
        if model.version != expected_model_version:
            print("Expected model version doesn't match version loaded")
        if model.sklearn_version != sklearn_version:
            print("Warning: model created under different sklearn version")
        return model

    else:
        print("Expected model not found. Run the model fit")
        
In [5]:
# https://github.com/jbmouret/matplotlib_for_papers/blob/master/src/plot_variance_matplotlib_white.py
def clean_layout(ax):
    # Delete unnecessary lines
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)

    try:
        # Remove frame color
        frame = ax.legend_.get_frame()
        frame.set_facecolor('1.0')
        frame.set_edgecolor('1.0')
    except:
        pass

    # y axis grid line
    ax.grid(axis='y', color="0.9", linestyle='-', linewidth=1)
    ax.set_axisbelow(True)

Load Data

In [6]:
df0 = pd.read_csv('../data/df_data_step3b_EDA.csv')

df_sales = pd.read_csv('../data/time_sales_data_step3b_EDA.csv',index_col=None)

ModelComparison = pd.read_csv('../data/ModelComparison_data_step4_EDA.csv',index_col=None)

# This is necessary for making pipeline
dummy_step04 = pd.read_csv('../data/dummies_in_step4_EDA.csv',index_col=None)
In [7]:
df0.head()
Out[7]:
url id Lng Lat Cid tradeTime followers totalPrice price square livingRoom drawingRoom kitchen bathRoom buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district year month floorType floorHeight DOM_cat 10yrBond_up% hmi_up% sse_up%
0 https://bj.lianjia.com/chengjiao/BJHD84136716.html BJHD84136716 116.2843 39.9427 1111027378362 2012-01-01 0 618.0000 38868 159.0000 3 2 2 1 4 2001 1 2 0.5000 1 0 1 8 2012 1 0 6 N -0.0423 -0.0212 0.0424
1 https://bj.lianjia.com/chengjiao/BJCY84233046.html BJCY84233046 116.5054 39.9391 1111027379229 2012-01-01 0 95.0000 20145 47.1600 1 1 1 1 4 1995 1 2 0.5000 0 0 0 7 2012 1 3 6 N -0.0423 -0.0212 0.0424
2 https://bj.lianjia.com/chengjiao/BJCY84129495.html BJCY84129495 116.4551 39.9315 1111027376792 2012-01-01 0 223.0000 28916 77.1200 3 1 1 1 4 1988 1 4 0.3330 0 1 1 7 2012 1 5 6 N -0.0423 -0.0212 0.0424
3 https://bj.lianjia.com/chengjiao/BJCY84131603.html BJCY84131603 116.4595 39.9540 1111027381384 2012-01-01 0 164.4000 21885 75.1200 3 1 1 1 3 1984 1 6 0.2500 1 1 1 7 2012 1 3 16 N -0.0423 -0.0212 0.0424
4 https://bj.lianjia.com/chengjiao/BJHD56417864.html BJHD56417864 116.3815 39.9778 1111027381832 2012-01-01 0 107.0000 23363 45.8000 1 1 1 1 3 1996 1 6 0.0710 1 1 0 8 2012 1 5 9 N -0.0423 -0.0212 0.0424

The financial models to use

In [8]:
# check the best stack model and the models that used in the stack model
ModelComparison.iloc[[7,9,10]]
Out[8]:
Model name MAE RMSE Explained_variance_score Best para
7 RandomForestRegressor w/ BayesSearchCV 4558.5951 7073.6714 0.9149 OrderedDict([('rf__max_depth', 100), ('rf__min_samples_leaf', 1), ('rf__min_samples_split', 2), ('rf__n_estimators', 600)])
9 GradientBoostingRegressor w/ BayesSearchCV 4196.3939 6513.0618 0.9278 OrderedDict([('gb__learning_rate', 0.11413168586297671), ('gb__max_depth', 8), ('gb__n_estimators', 1000)])
10 Stacking ensemble 4111.6664 6431.3238 0.9296 NaN

The model we use is Stacking ensemble, which is based on the other two models.

Add some additional transformer for a pipeline

Imoprt necessary libraries

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator

Data to use

As we are assuming we were in Jan, 2016, the data is set to only 2017.

Drop unnecessary columns

In [10]:
# Drop the columns unnecessary
class DropColumns(BaseEstimator):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X: pd.DataFrame, y=None):
        X = X.drop(['url','id','Cid','tradeTime', 'totalPrice'],axis=1) # The model was only trained on Metropolitan    
        return X
    

Dummy columns

The process is the same as step04

In [11]:
dummy_step04.columns
Out[11]:
Index(['buildingType_2', 'buildingType_3', 'buildingType_4', 'buildingType_5',
       'elevator_1', 'renovationCondition_2', 'renovationCondition_3',
       'renovationCondition_4', 'buildingStructure_2', 'buildingStructure_3',
       'buildingStructure_4', 'buildingStructure_5', 'buildingStructure_6',
       'fiveYearsProperty_1', 'subway_1', 'district_10', 'district_11',
       'district_12', 'district_13', 'district_2', 'district_3', 'district_4',
       'district_5', 'district_6', 'district_7', 'district_8', 'district_9',
       'floorType_1', 'floorType_2', 'floorType_3', 'floorType_4',
       'floorType_5', 'DOM_cat_1', 'DOM_cat_2'],
      dtype='object')
In [12]:
class DummyColumns(BaseEstimator):
    
    def _init__(self):
        pass
    
    def fit(self,X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        # Set feature values to make dummies for
        ls_label = ['buildingType','elevator','renovationCondition' , 'buildingStructure', 'fiveYearsProperty', 'subway', 'district', 'floorType', 'DOM_cat']
        dummies_columns = dummy_step04.columns
        
        dummies = pd.get_dummies(X[ls_label].astype(str),
#                                  drop_first=True
                                )
        dummies = dummies.reindex(columns = dummies_columns, fill_value=0)
        res = pd.concat([X, dummies], axis=1)
        res = res.drop(ls_label, axis=1)
        
        return res

Build a Pipeline

Import the best model in step04

In [13]:
model_path = '../models/stack_opt.pkl'
stack = load_CVmodel(model_path)
Expected model version has been loaded

Define the pipeline

In [14]:
from sklearn.pipeline import make_pipeline,Pipeline

# structure_pipeline = Pipeline([
#                 ("DropColumns", DropColumns()),
#                 ("DummyColumns", DummyColumns()),
#                 ])

financialModel_1_pipe = Pipeline([
                ("DropColumns", DropColumns()),
                ("DummyColumns", DummyColumns()),
                ("stack",stack),
                ])

Make a prediction

In [15]:
def make_prediction(df,model_pipeline, model_name:str):
    
    # Create train and test data set
    # This is for modeling, which is the data in 2018
    df_train = df.drop(['price'],axis=1)
    df_test = df[['price']]
    
#     data = structure_pipeline.transform(df_train)   # Save the transformation to a separate df
    predictions = model_pipeline.predict(df_train).reshape(-1,1)   # Make the prediction on the original df
    
    # Append the prediction to the separate transformed data
#     data[f'preidcitons_price'] = y_scaler.inverse_transform(predictions)
    
    data = df.copy()
    data['preidcitons_price'] = predictions
    data.rename(columns={'preidcitons_price':f'modelPrice_{model_name}'}, inplace=True)
#     data['price'] = df_test
    return data
In [16]:
price_prediction_stack = make_prediction(df0[df0.year==2018],financialModel_1_pipe, 'stack')
In [17]:
price_prediction_stack.tail()
Out[17]:
url id Lng Lat Cid tradeTime followers totalPrice price square livingRoom drawingRoom kitchen bathRoom buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district year month floorType floorHeight DOM_cat 10yrBond_up% hmi_up% sse_up% modelPrice_stack
312626 https://bj.lianjia.com/chengjiao/101101989805.html 101101989805 116.5854 39.9119 1111027377521 2018-01-22 98 261.0000 37244 70.0800 2 1 1 1 1 2010 4 6 0.2000 1 1 1 7 2018 1 3 22 L 0.0074 -0.0064 0.0525 43307.7790
312627 https://bj.lianjia.com/chengjiao/101102201096.html 101102201096 116.2203 40.0436 1111027376008 2018-01-22 112 830.0000 51096 162.4400 4 2 1 2 4 2009 4 6 0.5000 1 1 0 8 2018 1 5 8 L 0.0074 -0.0064 0.0525 59094.0614
312628 https://bj.lianjia.com/chengjiao/101102219310.html 101102219310 116.3464 40.0014 1111027380544 2018-01-24 1 460.0000 76159 60.4000 2 1 1 1 4 1986 3 2 0.5000 0 1 1 8 2018 1 3 6 L 0.0074 -0.0064 0.0525 88311.6216
312629 https://bj.lianjia.com/chengjiao/101102448228.html 101102448228 116.4415 39.8721 1111027374184 2018-01-26 0 410.0000 63468 64.6000 2 1 1 1 1 1995 1 6 0.2500 1 0 1 2 2018 1 5 25 L 0.0074 -0.0064 0.0525 67889.6779
312630 https://bj.lianjia.com/chengjiao/101102321677.html 101102321677 116.4170 40.0068 1111027375671 2018-01-28 53 345.0000 71577 48.2000 1 1 1 1 4 1994 4 2 0.5000 0 0 1 7 2018 1 5 6 L 0.0074 -0.0064 0.0525 77901.4985
In [18]:
# Difference between price and model price
def calModelGap(df):
    df['gap'] = df['price']-df['modelPrice_stack']
    
calModelGap(price_prediction_stack)
sns.histplot(price_prediction_stack['gap'],kde=True)
sns.despine(left=True);

This tells price in actual transaction is more inexpensive than listed price.

Let's check the the worst 5.

In [19]:
price_prediction_stack.sort_values(by='gap',ascending=True,)[['price','modelPrice_stack','gap']].head()
Out[19]:
price modelPrice_stack gap
312572 20128 58948.2200 -38820.2200
312518 84000 108788.5022 -24788.5022
312581 90266 114469.8891 -24203.8891
312499 102593 126578.3969 -23985.3969
312580 92575 112260.5957 -19685.5957

Problem 1: Houses with too low price

These houses will be found by making use of standard deviation.

In [20]:
sigma = price_prediction_stack['gap'].std()
sigma
Out[20]:
6170.659098860616
In [21]:
price_prediction_stack['priceTooLow'] = price_prediction_stack['gap'].apply(lambda x: 1 if x<-2*sigma  else 0)

tooLow = price_prediction_stack[price_prediction_stack.priceTooLow==1]
In [22]:
tooLow.head()
Out[22]:
url id Lng Lat Cid tradeTime followers totalPrice price square livingRoom drawingRoom kitchen bathRoom buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district year month floorType floorHeight DOM_cat 10yrBond_up% hmi_up% sse_up% modelPrice_stack gap priceTooLow
312420 https://bj.lianjia.com/chengjiao/101102428419.html 101102428419 116.4172 39.9662 1111027380878 2018-01-01 30 620.0000 83311 74.4200 3 1 1 1 4 1976 3 2 0.5000 0 1 1 1 2018 1 5 5 M 0.0074 -0.0064 0.0525 98291.2766 -14980.2766 1
312422 https://bj.lianjia.com/chengjiao/101102212168.html 101102212168 116.3306 39.8920 1111027380537 2018-01-01 220 405.0000 70143 57.7400 2 1 1 1 4 1990 3 2 0.3330 0 0 1 10 2018 1 5 6 L 0.0074 -0.0064 0.0525 83152.8371 -13009.8371 1
312432 https://bj.lianjia.com/chengjiao/101101811228.html 101101811228 116.3523 40.0624 1111027382146 2018-01-01 58 660.0000 64328 102.6000 3 2 1 1 1 1996 3 6 0.2500 1 1 1 8 2018 1 5 22 L 0.0074 -0.0064 0.0525 78807.7570 -14479.7570 1
312433 https://bj.lianjia.com/chengjiao/101102449681.html 101102449681 116.2324 40.2355 1111027380837 2018-01-01 0 131.0000 39697 33.0000 1 1 1 1 4 1984 1 2 0.3330 0 0 0 6 2018 1 1 5 S 0.0074 -0.0064 0.0525 55479.8327 -15782.8327 1
312499 https://bj.lianjia.com/chengjiao/101102388385.html 101102388385 116.3586 39.9321 1111027376241 2018-01-02 49 415.5000 102593 40.5000 1 1 1 1 1 1992 3 2 0.2000 1 1 1 10 2018 1 5 20 L 0.0074 -0.0064 0.0525 126578.3969 -23985.3969 1
In [23]:
tooLow_new = tooLow.iloc[:,:31].copy()
In [24]:
tooLow_new.renovationCondition = int(4)
In [25]:
tooLow_new.head()
Out[25]:
url id Lng Lat Cid tradeTime followers totalPrice price square livingRoom drawingRoom kitchen bathRoom buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district year month floorType floorHeight DOM_cat 10yrBond_up% hmi_up% sse_up%
312420 https://bj.lianjia.com/chengjiao/101102428419.html 101102428419 116.4172 39.9662 1111027380878 2018-01-01 30 620.0000 83311 74.4200 3 1 1 1 4 1976 4 2 0.5000 0 1 1 1 2018 1 5 5 M 0.0074 -0.0064 0.0525
312422 https://bj.lianjia.com/chengjiao/101102212168.html 101102212168 116.3306 39.8920 1111027380537 2018-01-01 220 405.0000 70143 57.7400 2 1 1 1 4 1990 4 2 0.3330 0 0 1 10 2018 1 5 6 L 0.0074 -0.0064 0.0525
312432 https://bj.lianjia.com/chengjiao/101101811228.html 101101811228 116.3523 40.0624 1111027382146 2018-01-01 58 660.0000 64328 102.6000 3 2 1 1 1 1996 4 6 0.2500 1 1 1 8 2018 1 5 22 L 0.0074 -0.0064 0.0525
312433 https://bj.lianjia.com/chengjiao/101102449681.html 101102449681 116.2324 40.2355 1111027380837 2018-01-01 0 131.0000 39697 33.0000 1 1 1 1 4 1984 4 2 0.3330 0 0 0 6 2018 1 1 5 S 0.0074 -0.0064 0.0525
312499 https://bj.lianjia.com/chengjiao/101102388385.html 101102388385 116.3586 39.9321 1111027376241 2018-01-02 49 415.5000 102593 40.5000 1 1 1 1 1 1992 4 2 0.2000 1 1 1 10 2018 1 5 20 L 0.0074 -0.0064 0.0525
In [26]:
tooLow_new_reno4 = make_prediction(tooLow_new,financialModel_1_pipe, 'stack')
In [27]:
tooLow_new_reno4[['renovationCondition','modelPrice_stack']]
len(tooLow_new_reno4[['renovationCondition','modelPrice_stack']])
Out[27]:
21
In [28]:
tooLow['reno_4'] = tooLow_new_reno4['modelPrice_stack']
In [29]:
table1 = tooLow[['renovationCondition','modelPrice_stack','reno_4']]
In [30]:
# Expected increase in total sales
tooLow_new_reno4['modelPrice_stack'].sum() - tooLow_new_reno4['price'].sum()
Out[30]:
383436.85648112115
In [31]:
# calculate the gap between before-price and after-price
table1['gap'] = table1['reno_4']-table1['modelPrice_stack']
In [32]:
# filter out the houses which prices can be improved by hard floor renovation
temp = table1[table1['gap'] > table1['gap'].std()]
In [65]:
# rename the table
tempp = temp.rename(columns={
                    'renovationCondition':'Condition before renovation',
                    'modelPrice_stack':'price before',
                    'reno_4':'price after',
                    
                    
                    
                    })

tempp['percent increase'] = tempp['gap']/tempp['price before']*100
In [66]:
# 
tempp
Out[66]:
Condition before renovation price before price after gap percent increase
312544 1 86310.5519 88018.6845 1708.1327 1.9791
312547 1 71396.3293 73315.1875 1918.8581 2.6876
312549 3 79108.8329 80763.9368 1655.1039 2.0922
312580 3 112260.5957 116142.8875 3882.2918 3.4583
312581 3 114469.8891 115835.9889 1366.0998 1.1934
312609 1 116670.0002 118708.5940 2038.5938 1.7473
312624 3 103665.3564 107134.1602 3468.8038 3.3462
In [67]:
# 
tempp.describe()
Out[67]:
Condition before renovation price before price after gap percent increase
count 7.0000 7.0000 7.0000 7.0000 7.0000
mean 2.1429 97697.3651 99988.4913 2291.1263 2.3577
std 1.0690 18510.9003 18876.5085 976.4876 0.8406
min 1.0000 71396.3293 73315.1875 1366.0998 1.1934
25% 1.0000 82709.6924 84391.3106 1681.6183 1.8632
50% 3.0000 103665.3564 107134.1602 1918.8581 2.0922
75% 3.0000 113365.2424 115989.4382 2753.6988 3.0169
max 3.0000 116670.0002 118708.5940 3882.2918 3.4583

Q1: How does renovation contribute to increase in unit price ('price')?

Check how gap distributes against renovationCategory

In [36]:
sns.boxplot(x="renovationCondition", y="gap", data=price_prediction_stack)
sns.stripplot(x="renovationCondition", y="gap", data=price_prediction_stack, marker="o", alpha=0.3, color="black", order=[1,2,3,4]);
sns.despine(left=True);

Definition of reonvation condition category

  • renovationCondition == 1 - "Other"
  • renovationCondition == 2 ~ "Rough"
  • renovationCondition == 3 ~ "Simplicity"
  • renovationCondition == 4 ~ "Hardcover"
In [37]:
sns.boxplot(x="renovationCondition", y="modelPrice_stack", data=price_prediction_stack,)
sns.stripplot(x="renovationCondition", y="modelPrice_stack", data=price_prediction_stack, marker="o", alpha=0.3, color="black", order=[1,2,3,4]);
sns.despine(left=True);

Q2 How different are Hardcover renovationa (cat4) and Simplicity renovation (cat3)?

According to this, there seem to be no difference between Hardcover renovationa (cat4) and Simplicity renovation (cat3). If this is true, there is no merit to spend extra money for renovation.

Next, we statistically test this.

In [38]:
# Create a column called `Permutation1`, and assign to it the result of permuting (shuffling) the price column
price_prediction_stack['Permutation1'] = np.random.permutation(price_prediction_stack.price)

# Call the describe() method on our permutation grouped by 'platform'. 
price_prediction_stack.groupby('renovationCondition')[['Permutation1']].describe()


price_prediction_stack.groupby('renovationCondition')[['modelPrice_stack','Permutation1']].describe().T
Out[38]:
renovationCondition 1 2 3 4
modelPrice_stack count 27.0000 3.0000 75.0000 116.0000
mean 62790.2013 85143.7201 66682.8790 64544.8123
std 24143.8350 48543.2395 21695.3674 23287.8676
min 31056.3548 52911.6294 30265.9049 30234.8589
25% 44957.7321 57228.2974 49769.5435 47173.4739
50% 55479.8327 61544.9653 64129.9297 60388.3475
75% 78978.0442 101259.7654 78958.2949 75802.5472
max 125663.0349 140974.5655 126578.3969 132175.4832
Permutation1 count 27.0000 3.0000 75.0000 116.0000
mean 63112.5556 51662.0000 60146.0933 58879.3448
std 24293.3946 17638.3102 19986.6952 22991.5126
min 24771.0000 38704.0000 32773.0000 20128.0000
25% 43335.5000 41618.5000 45209.0000 42405.0000
50% 61218.0000 44533.0000 55917.0000 52781.5000
75% 80808.0000 58141.0000 69002.0000 71589.7500
max 122165.0000 71749.0000 142120.0000 124690.0000
In [39]:
price_prediction_stack.groupby('renovationCondition')[['Permutation1']].mean()
Out[39]:
Permutation1
renovationCondition
1 63112.5556
2 51662.0000
3 60146.0933
4 58879.3448
In [40]:
renoCat1 = price_prediction_stack[price_prediction_stack.renovationCondition==1].modelPrice_stack
renoCat2 = price_prediction_stack[price_prediction_stack.renovationCondition==2].modelPrice_stack
renoCat3 = price_prediction_stack[price_prediction_stack.renovationCondition==3].modelPrice_stack
renoCat4 = price_prediction_stack[price_prediction_stack.renovationCondition==4].modelPrice_stack

Check the distribution of the data

In [41]:
from scipy import stats

renoCat3_normal = stats.normaltest(renoCat3)
renoCat3_normal
Out[41]:
NormaltestResult(statistic=7.441407966140893, pvalue=0.024216913546271368)
In [42]:
renoCat4_normal = stats.normaltest(renoCat4)
renoCat4_normal
Out[42]:
NormaltestResult(statistic=21.167788072776986, pvalue=2.532055512935648e-05)

Both are not normal.

Hypothesis formulation

Our Null hypothesis is just:

H null: the observed difference in the mean Hardcover renovationa (cat4) and mean Simplicity renovation (cat3) is due to chance (and thus not due to the renovation).

H alternative: the observed difference is not due to chance (and is actually due to renovation)

We're also going to pick a significance level of 0.05.

Permutation test

Is there wny difference among categories? Here, we perform permutaiton test. The data to use is modelPrice_stack predicted by the financial model.

In [43]:
sns.histplot(renoCat3,kde=True,color='r')
sns.histplot(renoCat4,kde=True,color='b')
sns.despine(left=True);
In [44]:
# Another way per the Datacamp in sec.11.3

np.random.seed(42)
def permutation_sample(data1, data2):
    """Generate a permutation sample from two data sets."""

    # Concatenate the data sets: data
    data = np.concatenate((data1, data2))

    # Permute the concatenated array: permuted_data
    permuted_data = np.random.permutation(data)

    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]

    return perm_sample_1, perm_sample_2

def diff_of_means(data_1, data_2):
    """Difference in means of two arrays."""

    # The difference of means of data_1, data_2: diff
    diff = np.mean(data_1)-np.mean(data_2)

    return diff

def draw_perm_reps(data_1, data_2, func, size=1):
    """Generate multiple permutation replicates."""

    # Initialize array of replicates: perm_replicates
    perm_replicates = np.empty(size)

    for i in range(size):
        # Generate permutation sample
        perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)

        # Compute the test statistic
        perm_replicates[i] = func(perm_sample_1, perm_sample_2)

    return perm_replicates

def permutationTest(a,b):
    # Compute difference of mean impact force from experiment: empirical_diff_means
    empirical_diff_means = diff_of_means(a,b)

    # Draw 10,000 permutation replicates: perm_replicates
    perm_replicates = draw_perm_reps(a,b,diff_of_means, size=10000)

    # Compute p-value: p
    # two-tailed
    p = np.sum(perm_replicates >= empirical_diff_means)/ len(perm_replicates)

    # Print the result
    print('empirical_diff_means ',empirical_diff_means)
    print('p-value =', p)
    
    
    plt.hist(perm_replicates)
    plt.axvline(empirical_diff_means, color='r')
In [45]:
permutationTest(renoCat3,renoCat4)
sns.despine(left=True);
empirical_diff_means  2138.0667016161024
p-value = 0.2631

There is no difference between 3 and 4 renovationCondition. Based on this result, we cannot deny the null hypoothesis that the distributions are same. The benefit of renovation is not significant in terms of price appreciation.

T-test for the means of two independent samples

In [46]:
import random
from scipy.stats import ttest_ind

ttest,pval = ttest_ind(renoCat3,renoCat4, equal_var = False)
print("ttest",ttest)
print('p value',format(pval, '.70f'))

if pval <0.05:
    print("we reject null hypothesis")
else:
    print("we accept null hypothesis")
ttest 0.6460900335843099
p value 0.5191130687439893254264688948751427233219146728515625000000000000000000
we accept null hypothesis

Problem 2: How to increase price by 15% by any additional changes within the next 3 months?

One of the methods to increase a housing price is renovation. How much appreciation in price will be expected if Hardcover renovation (cat4) are applied to all houses? Typical renovation is done within one month, and only concern here is to check if the improvement in price exceeds by 15% or not.

In [47]:
# Make a dataframe for this question
dfQ2 = df0[df0.year==2018]
In [48]:
# Change therenovationCondition to 4
dfQ2.renovationCondition = 4
dfQ2.head()
Out[48]:
url id Lng Lat Cid tradeTime followers totalPrice price square livingRoom drawingRoom kitchen bathRoom buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district year month floorType floorHeight DOM_cat 10yrBond_up% hmi_up% sse_up%
312410 https://bj.lianjia.com/chengjiao/101102425915.html 101102425915 116.6553 40.1392 1111027381518 2018-01-01 3 223.0000 40022 55.7200 1 1 1 1 4 1989 4 2 0.5000 0 1 1 13 2018 1 0 6 M 0.0074 -0.0064 0.0525
312411 https://bj.lianjia.com/chengjiao/101102287189.html 101102287189 116.4358 39.7981 1111027379000 2018-01-01 106 395.0000 44533 88.7000 2 1 1 1 3 2010 4 6 0.2500 1 0 0 2 2018 1 1 10 L 0.0074 -0.0064 0.0525
312412 https://bj.lianjia.com/chengjiao/101102356574.html 101102356574 116.4776 39.9537 1111027382733 2018-01-01 15 332.0000 71032 46.7400 1 1 1 1 4 1996 4 2 0.5000 0 1 1 7 2018 1 3 6 L 0.0074 -0.0064 0.0525
312413 https://bj.lianjia.com/chengjiao/101102281973.html 101102281973 116.6846 39.9111 1111027381219 2018-01-01 48 350.0000 47164 74.2100 1 1 1 1 3 2003 4 6 0.5000 1 1 0 11 2018 1 5 12 L 0.0074 -0.0064 0.0525
312414 https://bj.lianjia.com/chengjiao/101102394663.html 101102394663 116.5551 39.9574 1111041166357 2018-01-01 20 407.0000 45531 89.3900 3 1 1 1 4 2011 4 6 0.2500 0 1 0 7 2018 1 5 21 L 0.0074 -0.0064 0.0525
In [49]:
# Predict new prices
Q2 = make_prediction(dfQ2,financialModel_1_pipe, 'stack')
Q2.rename(columns={'modelPrice_stack':'price_renovation3to4'}, inplace=True)
price_prediction_stack = pd.concat([price_prediction_stack,Q2.iloc[:,-1:]],axis=1)
In [50]:
price_prediction_stack['renovTo4_pct'] =\
(price_prediction_stack.price_renovation3to4-price_prediction_stack.modelPrice_stack)/price_prediction_stack.modelPrice_stack*100

price_prediction_stack.head()
Out[50]:
url id Lng Lat Cid tradeTime followers totalPrice price square livingRoom drawingRoom kitchen bathRoom buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district year month floorType floorHeight DOM_cat 10yrBond_up% hmi_up% sse_up% modelPrice_stack gap priceTooLow Permutation1 price_renovation3to4 renovTo4_pct
312410 https://bj.lianjia.com/chengjiao/101102425915.html 101102425915 116.6553 40.1392 1111027381518 2018-01-01 3 223.0000 40022 55.7200 1 1 1 1 4 1989 3 2 0.5000 0 1 1 13 2018 1 0 6 M 0.0074 -0.0064 0.0525 41683.9963 -1661.9963 0 41517 41592.6232 -0.2192
312411 https://bj.lianjia.com/chengjiao/101102287189.html 101102287189 116.4358 39.7981 1111027379000 2018-01-01 106 395.0000 44533 88.7000 2 1 1 1 3 2010 4 6 0.2500 1 0 0 2 2018 1 1 10 L 0.0074 -0.0064 0.0525 45648.7808 -1115.7808 0 115750 45648.7808 0.0000
312412 https://bj.lianjia.com/chengjiao/101102356574.html 101102356574 116.4776 39.9537 1111027382733 2018-01-01 15 332.0000 71032 46.7400 1 1 1 1 4 1996 4 2 0.5000 0 1 1 7 2018 1 3 6 L 0.0074 -0.0064 0.0525 75790.8764 -4758.8764 0 39015 75790.8764 0.0000
312413 https://bj.lianjia.com/chengjiao/101102281973.html 101102281973 116.6846 39.9111 1111027381219 2018-01-01 48 350.0000 47164 74.2100 1 1 1 1 3 2003 4 6 0.5000 1 1 0 11 2018 1 5 12 L 0.0074 -0.0064 0.0525 53840.2484 -6676.2484 0 89204 53840.2484 0.0000
312414 https://bj.lianjia.com/chengjiao/101102394663.html 101102394663 116.5551 39.9574 1111041166357 2018-01-01 20 407.0000 45531 89.3900 3 1 1 1 4 2011 4 6 0.2500 0 1 0 7 2018 1 5 21 L 0.0074 -0.0064 0.0525 51993.8408 -6462.8408 0 54602 51993.8408 0.0000
In [51]:
f =(price_prediction_stack.renovationCondition==1)|(price_prediction_stack.renovationCondition==2)|(price_prediction_stack.renovationCondition==3)
In [52]:
price_prediction_stack[f][['renovTo4_pct']].describe().T
Out[52]:
count mean std min 25% 50% 75% max
renovTo4_pct 105.0000 2.0398 1.9167 -3.7066 1.0145 1.9791 2.7978 10.4713

The expected appreciation is as following.

  • Median: 2.0%
  • Mean: 2.0%
  • Std: 1.9%

Based on these, standard deviation is same level as its mean. According to this model, the average improvement in price by renovation is 2.0%. Therefore, if renovation cost is less than sales price times this 2.0%, the renovation may be worth an invest, but it si difficut to say that because the standard deviation is as high as 2.0.

Q4 Which house should be renovated?

Let's find the houses that should be renovated with teo sigma confidence.

In [53]:
sigma = price_prediction_stack[f].renovTo4_pct.std()
sigma
Out[53]:
1.9166712855484855
In [54]:
# filter out the only price_renovation3to4 that have significant level of positive percentage increase (price_renovation3to4) by more than two sigma.
price_prediction_stack[price_prediction_stack.renovTo4_pct>sigma*2]
Out[54]:
url id Lng Lat Cid tradeTime followers totalPrice price square livingRoom drawingRoom kitchen bathRoom buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district year month floorType floorHeight DOM_cat 10yrBond_up% hmi_up% sse_up% modelPrice_stack gap priceTooLow Permutation1 price_renovation3to4 renovTo4_pct
312454 https://bj.lianjia.com/chengjiao/101101748885.html 101101748885 116.3534 39.8756 1111027378762 2018-01-01 10 646.0000 61937 104.3000 3 1 1 2 1 2001 3 6 0.2500 1 0 0 2 2018 1 5 30 L 0.0074 -0.0064 0.0525 65895.8819 -3958.8819 0 46910 68556.7300 4.0380
312462 https://bj.lianjia.com/chengjiao/101102304111.html 101102304111 116.4332 40.0087 1111027377294 2018-01-02 28 624.5000 94593 66.0200 1 2 1 1 4 2006 1 6 0.6670 1 1 1 7 2018 1 3 15 L 0.0074 -0.0064 0.0525 96324.9894 -1731.9894 0 71328 101760.7891 5.6432
312483 https://bj.lianjia.com/chengjiao/101102344435.html 101102344435 116.5739 39.9288 1111027376842 2018-01-02 27 407.0000 44849 90.7500 2 1 1 1 3 2006 3 6 0.6670 1 1 1 7 2018 1 3 18 L 0.0074 -0.0064 0.0525 50672.6315 -5823.6315 0 53982 53254.1890 5.0946
312512 https://bj.lianjia.com/chengjiao/101101498405.html 101101498405 116.5044 39.9294 1111027376016 2018-01-02 77 284.0000 55350 51.3100 1 1 1 1 4 1994 2 2 0.3330 0 1 1 7 2018 1 0 6 L 0.0074 -0.0064 0.0525 61544.9653 -6194.9653 0 71749 64495.1746 4.7936
312524 https://bj.lianjia.com/chengjiao/101102118625.html 101102118625 116.3229 40.0886 1111027379919 2018-01-02 39 430.0000 33180 129.6000 3 2 1 2 4 2008 3 2 0.5000 0 1 0 6 2018 1 5 6 L 0.0074 -0.0064 0.0525 38054.1988 -4874.1988 0 37314 39536.5144 3.8953
312526 https://bj.lianjia.com/chengjiao/101102070706.html 101102070706 116.4046 39.9785 1111027375686 2018-01-02 169 373.0000 65600 56.8600 2 1 1 1 4 1986 3 2 0.5000 0 0 1 7 2018 1 5 6 L 0.0074 -0.0064 0.0525 73257.1512 -7657.1512 0 121191 79093.3696 7.9668
312545 https://bj.lianjia.com/chengjiao/101102088408.html 101102088408 116.4689 39.8825 1111027378493 2018-01-03 134 325.5000 55804 58.3300 1 1 1 1 1 1989 1 6 0.2500 1 1 1 7 2018 1 5 17 L 0.0074 -0.0064 0.0525 57716.6167 -1912.6167 0 41822 60098.5980 4.1270
312555 https://bj.lianjia.com/chengjiao/101102437883.html 101102437883 116.3421 39.7497 1111027378968 2018-01-03 20 222.0000 39728 55.8800 2 1 1 1 4 1997 3 2 0.5000 0 0 1 4 2018 1 2 6 M 0.0074 -0.0064 0.0525 48897.9008 -9169.9008 0 55882 50837.9111 3.9675
312558 https://bj.lianjia.com/chengjiao/101102403623.html 101102403623 116.4249 39.9898 1111027378004 2018-01-03 31 1002.0000 54602 183.5100 4 1 1 2 1 1995 3 6 0.3330 1 1 1 7 2018 1 5 19 L 0.0074 -0.0064 0.0525 55707.9042 -1105.9042 0 62131 61541.2313 10.4713
312592 https://bj.lianjia.com/chengjiao/101102414473.html 101102414473 116.4205 39.9979 1111027375605 2018-01-03 18 585.0000 69710 83.9200 3 1 1 1 1 1991 3 6 0.2500 1 0 0 7 2018 1 3 25 L 0.0074 -0.0064 0.0525 76840.7105 -7130.7105 0 45974 80425.2783 4.6649
312601 https://bj.lianjia.com/chengjiao/101102217765.html 101102217765 116.3349 39.7440 1111027378789 2018-01-03 23 270.0000 25902 104.2400 2 1 1 1 3 2006 1 6 0.2730 1 1 1 4 2018 1 0 19 L 0.0074 -0.0064 0.0525 36958.1370 -11056.1370 0 92575 38730.1515 4.7947
312605 https://bj.lianjia.com/chengjiao/101102278923.html 101102278923 116.4575 39.8764 1111027375273 2018-01-03 21 479.0000 56923 84.1500 2 1 1 1 1 2009 3 6 0.2500 1 0 1 7 2018 1 0 28 L 0.0074 -0.0064 0.0525 45196.4565 11726.5435 0 92269 48964.5491 8.3371
312611 https://bj.lianjia.com/chengjiao/101102434018.html 101102434018 116.4231 39.6764 1111027381935 2018-01-03 0 219.0000 24771 88.4100 2 1 1 1 4 2012 1 6 0.3330 0 0 0 4 2018 1 5 6 M 0.0074 -0.0064 0.0525 31056.3548 -6285.3548 0 41689 32351.5915 4.1706
312623 https://bj.lianjia.com/chengjiao/101102443459.html 101102443459 116.6474 39.8963 1111027380847 2018-01-03 1 405.0000 42718 94.8100 2 2 1 2 4 2004 3 2 0.5000 0 1 0 11 2018 1 0 6 M 0.0074 -0.0064 0.0525 50234.4308 -7516.4308 0 65600 52215.9362 3.9445

Those houses are financial-modelically worthwhile investment of renovation.

Q5 Which features can have significant benefit by renovation?

In [55]:
# Add a new column of boolean if renovTo4_pct is positive or not. 

def posiNega(x):
    if x > 0:
        return 1
    else:
        return 0

price_prediction_stack['renoPlus'] = price_prediction_stack['renovTo4_pct'].apply(posiNega)
price_prediction_stack.head()
Out[55]:
url id Lng Lat Cid tradeTime followers totalPrice price square livingRoom drawingRoom kitchen bathRoom buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district year month floorType floorHeight DOM_cat 10yrBond_up% hmi_up% sse_up% modelPrice_stack gap priceTooLow Permutation1 price_renovation3to4 renovTo4_pct renoPlus
312410 https://bj.lianjia.com/chengjiao/101102425915.html 101102425915 116.6553 40.1392 1111027381518 2018-01-01 3 223.0000 40022 55.7200 1 1 1 1 4 1989 3 2 0.5000 0 1 1 13 2018 1 0 6 M 0.0074 -0.0064 0.0525 41683.9963 -1661.9963 0 41517 41592.6232 -0.2192 0
312411 https://bj.lianjia.com/chengjiao/101102287189.html 101102287189 116.4358 39.7981 1111027379000 2018-01-01 106 395.0000 44533 88.7000 2 1 1 1 3 2010 4 6 0.2500 1 0 0 2 2018 1 1 10 L 0.0074 -0.0064 0.0525 45648.7808 -1115.7808 0 115750 45648.7808 0.0000 0
312412 https://bj.lianjia.com/chengjiao/101102356574.html 101102356574 116.4776 39.9537 1111027382733 2018-01-01 15 332.0000 71032 46.7400 1 1 1 1 4 1996 4 2 0.5000 0 1 1 7 2018 1 3 6 L 0.0074 -0.0064 0.0525 75790.8764 -4758.8764 0 39015 75790.8764 0.0000 0
312413 https://bj.lianjia.com/chengjiao/101102281973.html 101102281973 116.6846 39.9111 1111027381219 2018-01-01 48 350.0000 47164 74.2100 1 1 1 1 3 2003 4 6 0.5000 1 1 0 11 2018 1 5 12 L 0.0074 -0.0064 0.0525 53840.2484 -6676.2484 0 89204 53840.2484 0.0000 0
312414 https://bj.lianjia.com/chengjiao/101102394663.html 101102394663 116.5551 39.9574 1111041166357 2018-01-01 20 407.0000 45531 89.3900 3 1 1 1 4 2011 4 6 0.2500 0 1 0 7 2018 1 5 21 L 0.0074 -0.0064 0.0525 51993.8408 -6462.8408 0 54602 51993.8408 0.0000 0
In [56]:
features = ['followers',
       'totalPrice', 'square', 'livingRoom', 'drawingRoom', 'kitchen',
       'bathRoom', 'buildingType', 'constructionTime', 
       'buildingStructure', 'ladderRatio', 'elevator', 'fiveYearsProperty',
       'subway', 'district', 'year', 'month', 'floorType', 'floorHeight',
       'DOM_cat', '10yrBond_up%', 'hmi_up%', 'sse_up%', 'modelPrice_stack',
       'renoPlus']
# sns.pairplot(price_prediction_stack[features], hue="renoPlus");

Using pairplot, it is found that the columns, 'livingRoom', 'buildingType', 'followers', had more difference between renoPlus boolean than the others.

Next, check the histgrams of these columns to find which houses are better to renovate.

In [57]:
f, ax = plt.subplots(1, 3, figsize=(12, 6), sharex=False)

for i,_ in enumerate(['livingRoom', 'buildingType', 'followers']):
    sns.histplot(data=price_prediction_stack, x=_, multiple="dodge", hue="renoPlus",kde=True,ax=ax[i])
    sns.despine(left=True)
In [58]:
a = price_prediction_stack[price_prediction_stack.renoPlus==0].livingRoom
b = price_prediction_stack[price_prediction_stack.renoPlus==1].livingRoom
permutationTest(a,b)
sns.despine(left=True);
empirical_diff_means  0.05600000000000005
p-value = 0.319
In [59]:
a = price_prediction_stack[price_prediction_stack.renoPlus==0].buildingType
b = price_prediction_stack[price_prediction_stack.renoPlus==1].buildingType
permutationTest(a,b)
sns.despine(left=True);
empirical_diff_means  -0.0129999999999999
p-value = 0.5526
In [60]:
a = price_prediction_stack[price_prediction_stack.renoPlus==0].followers
b = price_prediction_stack[price_prediction_stack.renoPlus==1].followers
permutationTest(a,b)
sns.despine(left=True);
empirical_diff_means  20.08166666666667
p-value = 0.0072

We conclude that renovation will bring significant difference in followers.

Q6 How different are the prices between Listed price(price) vs Predicted price(modelPrice) ?

In [61]:
price_prediction_stack[['gap']].describe().T
Out[61]:
count mean std min 25% 50% 75% max
gap 221.0000 -5607.2161 6170.6591 -38820.2200 -8531.2841 -5125.8261 -1912.6167 26304.9237
In [62]:
sig = np.std(price_prediction_stack['gap'])
In [63]:
# price - modelPrice_stack is too low(-)
len(price_prediction_stack[price_prediction_stack.gap<-2*sig].sort_values(by='gap'))
                                                                    
Out[63]:
21
In [64]:
# price - modelPrice_stack is too high(+)
# houses which listed prices are much lower thatn predicted model
price_prediction_stack[price_prediction_stack.gap>1.9*sig].sort_values(by='gap',ascending = False).head(10)
Out[64]:
url id Lng Lat Cid tradeTime followers totalPrice price square livingRoom drawingRoom kitchen bathRoom buildingType constructionTime renovationCondition buildingStructure ladderRatio elevator fiveYearsProperty subway district year month floorType floorHeight DOM_cat 10yrBond_up% hmi_up% sse_up% modelPrice_stack gap priceTooLow Permutation1 price_renovation3to4 renovTo4_pct renoPlus
312469 https://bj.lianjia.com/chengjiao/101102086891.html 101102086891 116.4621 39.9238 1111027380611 2018-01-02 27 2260.0000 124690 181.2500 3 1 1 2 3 2008 4 6 0.7500 1 1 1 7 2018 1 5 25 L 0.0074 -0.0064 0.0525 98385.0763 26304.9237 0 26184 98385.0763 0.0000 0
312576 https://bj.lianjia.com/chengjiao/101101814992.html 101101814992 116.5038 39.9307 1111027374762 2018-01-03 82 1460.0000 80896 180.4800 3 1 1 2 1 2011 4 6 0.1670 0 0 0 7 2018 1 5 32 L 0.0074 -0.0064 0.0525 68910.0095 11985.9905 0 36747 68910.0095 0.0000 0
312605 https://bj.lianjia.com/chengjiao/101102278923.html 101102278923 116.4575 39.8764 1111027375273 2018-01-03 21 479.0000 56923 84.1500 2 1 1 1 1 2009 3 6 0.2500 1 0 1 7 2018 1 0 28 L 0.0074 -0.0064 0.0525 45196.4565 11726.5435 0 92269 48964.5491 8.3371 1

Those may be worthwhile to review and revise those prices.

Summary

There are not so many feature values that real estate company can change regarding their product. One of them is renovation, and we have checked how the renovation will contribute to the increase in house price. Our conclusion is that it will unfortunately not bring signicant diffference in the appreciation in housing price, it can be said that it may be brive to cut off renovatio cost as reasonably as possible.

In addition to the renovation factor, we also filter out the houses which prices had better to be reconsidered.